Support: surface hung tests on pytest session timeout#721
Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom May 8, 2026
Merged
Conversation
When the parent pytest dispatcher's session watchdog fired, the log showed only the timeout banner — the actually-stuck child subprocess was killed silently, its captured stdout discarded, and the child was left to be reaped as an orphan by the runner cleanup. Triaging the stuck case required guessing. Three changes make the timeout self-diagnosing: - parallel_scheduler: emit a `[scheduler] START label pid=... devices=...` line at launch (not only on completion). The last START without a matching PASS/FAIL identifies the stuck case. - parallel_scheduler: expose `_active_state` while `run_jobs` is in flight so the parent's signal handler can reach the live job table. - conftest: on session-timeout SIGALRM, send SIGUSR1 to every in-flight child (faulthandler, registered in `pytest_configure`, dumps all-thread Python+C stacks into the child's stdout — works even when the GIL is held by a native NPU call), let the pumps drain ~2s, then print each in-flight job's tail buffer in a `::group::HUNG ...` block, and finally call `_terminate_all` so children don't outlive us as orphans.
9525e7e to
fccdced
Compare
There was a problem hiding this comment.
Code Review
This pull request enhances the test session timeout handling by implementing a mechanism to debug hung child processes. It introduces a module-global state in the parallel scheduler to track active jobs, allowing the timeout handler to send SIGUSR1 signals to stuck processes for stack trace generation via faulthandler. The changes also include improved logging for job starts and GitHub Actions-style grouping for hung process output. I have no feedback to provide as there were no review comments.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the parent pytest dispatcher's session watchdog fired
(
--pto-session-timeout), the CI log only showed the timeout banner.The actually-stuck child subprocess was killed silently, its captured
stdout discarded, and the child was left to be reaped as an orphan by
the runner cleanup — see run 25541961688
for the symptom (9.5 min of dead air, then
[pytest] TIMEOUT, noindication which case was stuck).
This PR makes the timeout path self-diagnosing.
Changes
parallel_scheduler.py— emit[scheduler] START <label> pid=... devices=...at launch (not only at completion). The last START without a matching
PASS/FAIL identifies the hung case at a glance.
parallel_scheduler.py— expose_active_statewhilerun_jobsis in flight so the parent's signal handler can reach the live job
table.
conftest.py— on session-timeout SIGALRM:SIGUSR1to every in-flight child (faulthandler, registeredin
pytest_configure, dumps all-thread Python + C stacks into thechild's stdout — works even when the GIL is held by a native NPU
call, which Python-level watchdogs cannot reach);
output_lines;::group::HUNG ...block;
_terminate_allso children don't outlive us as orphans.What the log will look like next time
Normal: every case has a
[scheduler] STARTfollowed (out of order, inparallel) by a
::group:: ... [PASS|FAIL ...]block.On hang: the
STARTfor the stuck case has no matching PASS/FAIL, then[pytest] TIMEOUT: ..., then a::group::HUNG <label> pid=... elapsed=...block containing the all-thread traceback the child wrote when it caught
SIGUSR1. The existing CI retry (any non-zero rc → re-run with the pinned
PTO-ISA commit) is unchanged: a PTO-ISA regression that surfaces as a
hang is exactly the case that retry path is meant to recover.
Notes
pytest-timeoutper-test guard is intentionally not enabled; thesingle-layer "parent catches everything via SIGUSR1" path covers all
hang shapes (including native NPU deadlocks where pytest-timeout's
watchdog thread can't fire). If hangs become frequent enough that
burning the full session budget per incident is painful, adding
timeout = N/timeout_method = "thread"topyproject.tomlis apure increment on top of this PR.
guarded by
hasattr(signal, "SIGUSR1").Testing
time.sleep(9999)) under--pto-session-timeout 30; confirm the CI log shows::group::HUNG ...with traceback,rc=124, no orphan pythonprocesses after exit.
should look identical to before plus the new START lines).